On Finite Memory Universal Data Compression and 
Classification of Individual Sequences 



Jacob Ziv 
Department of Electrical Engineering 
Technion-Israel Institute of Technology 
Haifa 32000, Israel 



September 28, 2007 



Abstract 

Consider the case where consecutive blocks of TV letters of a semi-infinite individual 
sequence X over a finite-alphabet are being compressed into binary sequences by some 
one-to-one mapping. No a-priori information about X is available at the encoder, which 
must therefore adopt a universal data-compression algorithm. 

It is known that if the universal LZ77 data compression algorithm is successively 
applied to TV-blocks then the best error-free compression, for the particular individual 
sequence X is achieved as TV tends to infinity. 

The best possible compression that may be achieved by any universal data com- 
pression algorithm for finite TV-blocks is discussed. It is demonstrated that context tree 
coding essentially achieves it. 

Next, consider a device called classifier (or discriminator) that observes an individual 
training sequence X. The classifier's task is to examine individual test sequences of 
length TV and decide whether the test TV-sequence has the same features as those that 
are captured by the training sequence X, or is sufficiently different, according to some 
appropriate criterion. Here again, it is demonstrated that a particular universal context 
classifier with a storage-space complexity that is linear in TV, is essentially optimal. This 
may contribute a theoretical "individual sequence" justification for the Probabilistic 
Suffix Tree (PST) approach in learning theory and in computational biology. 

Index Terms: Data compression, universal compression, universal classification, context- 
tree coding. 
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A. Introduction and Summary of Results: 



Traditionally, the analysis of information processing systems is based on a certain modelling 
of the process that generates the observed data (e.g an ergodic process). Based on this a- 
priori model, a processor (e.g. a compression algorithm, a classifier, etc) is then optimally 
designed. In practice, there are many cases where insufficient a-priori information about 
this generating model is available and one must base the design of the processor on the 
observed data only, under some complexity constraints that the processor must comply 
with. 



1. Universal Data Compression with Limited Memory 

The Kolmogorov-Chaitin complexity (1968) is the length of the shortest program that can 
generate the given individual sequence via a universal Turing machine. More concrete results 
are achieved by replacing the universal Turing machine model with the more restricted 
finite-state machine model. 

The Finite- State (FS) normalised complexity (compression) H(X), measured in bits per 
input letter, of an individual infinite sequence X is the normalised length of the shortest 
one-to-one mapping of X into a binary sequence that can be achieved by any finite-state 
compression device [1]. For example, the counting sequence 0123456... when mapped into 
the binary sequence 0,1,00,01,10,11,000,001,010,011,100,101,110,111... is incompressible by 
any finite-state algorithm. Fortunately, the data that one has to deal with is in many cases 
compressible. 

The FS complexity was shown to be asymptotically achieved by applying the LZ universal 
data compression algorithm [1] to consecutive blocks of the individual sequence. The FS 
modelling approach was also applied to yield asymptotically optimal universal prediction 
of individual sequences [9] . 

Consider now the special case of a FS class of processors is further constrained to include 
only block-encoders that process one iV-string at a time and then start all over again, (e.g. 
due to bounded latency and error-propagation considerations). H(X.) is still asymptotically 
achievable by the LZ algorithm when applied on-line to consecutive strings of length N, as 
N tends to infinity [1]. But the LZ algorithm may not be the best on-line universal data 
compression algorithm when the block- length is of finite length N. 

It has been demonstrated that if it is a-priori known that X is a realization of a stationary 
ergodic, Variable Length Markov Chain (VLMC) that is governed by a tree model, then 
context-tree coding yields a smaller redundancy than the LZ algorithm ([10], [4]). More 
recently, it has been demonstrated that context-tree coding yields an optimal universal 
coding policy (relative to the VLMC assumption) ([2]). 
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Inspired by these results, one may ask whether the optimality of context-tree coding relative 
to tree models still holds for more general setups. 

It is demonstrated here that the best possible compression that may be achieved by any 
universal data compression algorithm for finite iV-blocks is essentially achieved by context- 
tree coding for any individual sequence X and not just for individual sequences that are 
realizations of a VLMC. 

In the following, a number of quantities are defined, that are characterised by non-traditional 
notations that seem unavoidable due to end-effects resulting from the finite length of X± . 
These end-effects vanish as N tends to infinity, but must be taken into account here. 

Refer to an arbitrary sequence over a finite alphabet A, | A| = A, X N = = X±, X%, ...; Xn € 
A, as being an individual sequence. Let X = Xi,X2,-. denote a semi-infinite sequence 
over the alphabet A. Next, an empirical probability distribution Pmn{Z\ , X) is defined 
for N vectors that appear in a sequence of length MN . The reason for using the notation 
Pmn(Zi , N) rather than, say, Pmn{Z^) is due to end-effects as discussed below. We 
then define an empirical entropy that results from Pmn(Z± \N), namely Hmn(X). This 
quantity is similar to the classical definition of the empirical entropy of iV-blocks in an 
individual sequence of length MN and as one should anticipate, serves as a lower bound 
for the compression that can be achieved by any N- block encoders. 

Furthermore, Hmn(X) is achievable in the impractical case where one is allowed to first scan 
the long sequence X^ N , generate the corresponding empirical probability Pmn{Z\ , X) for 
each TV- vector that appears in X^ IN , and apply the corresponding Huffman coding to 
consecutive iV-blocks. 

Then, define H(X,N) = limsupj^^,^ Hmn(X). It follows that 

H(X) = limsup#(X,7V) 

is the smallest number of bits per letter that can be asymptotically achieved by any N- 
block data-compression scheme for X. However, in practice, universal data-compression is 
executed on-line and the only available information on X is the currently processed iV-block. 

Next, for t < N, an empirical probability measure Pmn{Z[,N) is defined for ^-vectors 
that appear in an MiV-sequence, which is derived from Pmn{Z± , N) by summing up 
Pmn{Z\ ' i X) over the last N — I letters of N vectors in the MN- sequence. Again, observe 
that due to end-effects, Pmn{Z[, N) is different from Pmn{Z[, £) but converges to it asymp- 
totically, as M tends to infinity. Similarly, an empirical entropy Hmn(£, N) is derived from 
Pmn{Z[,N). In the analysis that follows below both Hmn{X) and Hmn(£,X) play an 
important role. 

An empirical entropy Hmn(£\zI~ 1 , N) is associated with each vector zf _1 ; 1 < I < (log iV) 2 
in X™ N . This entropy is derived from P MN (z ( \z^ x , N) = p Miv(^,Af) Notg th&t thig 

Pmn(z 1 ,N) 
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empirical entropy is conditioned on the particular value of z\ 1 and is not averaged over all 
z^ 1 € A^ 1 relative to P MN {z[- l ,N). 

A context-tree with approximately N leaves, that consists of the N most empirically prob- 
able contexts in X^ N , is generated. For each leaf of this tree, choose the one context 
among the contexts along the the path that leads from the root of the tree to this leaf, for 
which the associated entropy is the smallest. Then, these minimal associated entropies are 
averaged over the set of leaves of the tree. This average entropy is denoted by H U (N,M). 
Note that H U (N,M) is essentially an empirical conditional entropy which is derived for a 
suitably derived variable-length Markov chain (VLMC). 

Finally, define H U (K,N) = limsupj^^^ H U (N, M). It is demonstrated that 
liminf Jv ^oo[^(X,A^)-fl" u (X,A)] > 0. Thus, for large enough TV, H V (X,N), like H(X.,N), 
may also serve as a lower bound on the compression that may be achieved by any encoder 
for iV-sequences. The relevance of H U (X.,N) becomes apparent when it is demonstrated 
in Theorem 2 below that a context-tree universal data-compression scheme, when applied 
to iV'-blocks, essentially achieves H U (X,N) for any X if logiV' is only slightly larger than 
log N, and achieves H(X) as N' tends to infinity. 

Furthermore, it is shown in Theorem 1 below that among the many compressible sequences 
X for which H U (X., N) = H(X.); H(X.) < log A, there are some for which no on-line universal 
data-compression algorithm can achieve any compression at all when applied to consecutive 
blocks of length N', if log A 7 is slightly smaller than log A. Thus, context-tree universal 
data-compression is therefore essentially optimal. Note that the threshold effect that is 
described above is expressed in a logarithmic scaling of N. At the same time, the logarithmic 
scaling of N is apparently the natural scaling for the length of contexts in a context-tree 
with N leaves. 



2. Application to Universal Classification 

A device called classifier (or discriminator) observes an individual training sequence of 
length of m letters, X" 1 . 

The classifier's task is to consider individual test sequences of length N and decide whether 
the test iV-sequence has, in some sense the same properties as those that are captured by 
the training sequence, or is sufficiently different, according to some appropriate criterion. 
No a priori information about the test sequences is available to the classifier asides from 
the training sequence. 

A universal classifier rf(X™,A ff ) for N- vectors is defined to be a mapping from A^ onto 
{0, 1}. Upon observing , the classifier declares to be similar to one of the iV-vectors 
= 0, 1, ...,m — N iff d{X"\ Zf e A N ) = 1 (or, in some applications, if a slightly 
distorted version of Z? satisfies d(X^, Z?) = 1). 
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In the classical case, the probability distribution of iV-sequences is known and an opti- 
mal classifier accepts all iV-sequences Z^ such that the probability P(Z^) is bigger than 
some preset threshold. If Xf 1 is a realization of a stationary ergodic source one has, by 
the Asymptotic Equipartition Property (A.E.P) of information theory, that the classifier's 
task is tantamount (almost surely for large enough m and N) to deciding whether the test 
sequence is equal to a "typical" sequence of the source (or, when some distortion is accept- 
able, if a slightly distorted version of the test sequence is equal to a "typical" sequence). 
The cardinality of the set of typical sequences is, for large enough m and N, about 2 NH , 
where H is the entropy rate of the source [10]. 

What to do when P(Z^) is unknown or does not exist, and the only available information 
about the generating source is a training sequence X™1 The case where the training 
sequence is a realization of an ergodic source with vanishing memory is studied in [11], 
where it demonstrated that a certain universal context-tree based classifier is essentially 
optimal for this class of sources (the fact that context-trees are versatile for classes of 
renewal sources was recently demonstrated in [20]). This is in unison with related results 
on universal prediction ([11], [12], [13], [18], [19]). 

Universal classification of test sequences relative to a long training sequence is a central 
problem in computational biology. One common approach is to assume that the training 
sequence is a realization of a VLMC, and upon viewing the training sequence, to construct 
an empirical Probabilistic Suffix Tree (PST), the size of which is limited by the available 
storage complexity of the classifier, and apply a context-tree based classification algorithm 
[7], [8]. 

But how should one proceed if there is no a-priori support for the VLMC assumption? In 
the following, it is demonstrated that the PST approach is essentially optimal for every 
individual training sequence, even without the VLMC assumption. Denote by Sd(N, X™, e) 
a set of iV-sequences Z^ which are declared to be similar to X? (i.e. d{X^,Z^) = 1), 
where d(X™, Xj^) = should be satisfied by no more than e(m — N + 1) instances 
j = 0, 1,2, ...,m — N, and where e is an arbitrarily small positive number. Also, given a 
particular classifier, let D d (N, Xf\ e) = \S d (N,X^,e)\ and let 

H d (N,X?,e) = -^logD d (N,X?,e) 

Thus, any classifier is characterised by a certain H d (N, X™, e). Given A™, let D d:Jnin (N, X™, e) 
be the smallest achievable Dd(N, X™, e). Denote by d* the particular classifier that achieves 
Dd,min(N,X™,e) and let 

H min (N, X?, e) = 1 log D d . (N, X?, e) 
For an infinite training sequence X, let 

H(N, X, e) = limsup H min (N, X™, e) 
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. Note that 

H(X) = limlimsup# mira (A,X,e) 
is the topological entropy of X [6]. 

Naturally, if the classifier has the complete list of iV-vectors that achieve D e ,min 

it can achieve a perfect classification by making d(X[ n , Z^) = 1 iff = xj^, for every 

instance j = 0, 1, 2, m — A for which Xj^ is in this complete list. 

The discussion is constrained to cases where H m i n (N, X™, e) > 0. Therefore, when m 
is large, Dd tTn i n (N, X™, e) grows exponentially with N(e.g. when the test sequence is a 
realization of an ergodic source with a positive entropy rate). The attention is limited to 
classifiers that have a storage-space complexity that grows only linearly with N. Thus, the 
long training sequence cannot be stored. Rather, the classifier is constrained to represent 
the long training sequence by a short "signature", and use this short signature to classify 
incoming test sequences of length N. It is shown that it is possible to find such a classifier, 
denoted by <i(X™, e, Z^), which is essentially optimal in the following sense: 

An optimal "e-efficient" universal classifier ti(X™,e, Z^) is defined to be one that satisfies 
the condition that d(X[ n , Xj+f ) = 1 for (1 - i)(m - N + 1) instances j = 0, 1, ...m - N, 
where e < e. This corresponds to the rejection of at most eD m i n (N, X™, e) vectors from 
among the D m i n (N, X™, e) typical A- vectors in X™. Also, an optimal "e-efficient" universal 
classifier should satisfy the condition that d(Xj n ,Z^ v ) = 1 is satisfied by no more than 
2NH min (N,x^,e)+e N_ vec t ors . This corresponds to a false-alarm rate of 

2 NH mln (N,X™,e)+e _ 2 NH mln (N,X™,e) 
2ATlogA _ 2 NH min(N,X™,e) 

when A^-vectors are selected independently at random, with an induced uniform probability 
distribution over the set of 2 NlogA — 2 N ( Hmin ( N ' X T 1 ' 6 ^ N- vectors that should be rejected. 
Note that the false-alarm rate is thus guaranteed to decrease exponentially with iV for any 
individual sequence X™ for which H m i n (N, X™, e) < log A — e. 

A context-tree based classifier for A-sequences, given an infinite training sequence X and a 
storage-complexity of 0(A), is shown by Theorem 3 below to be essentially e-efficient ( and 
therefore essentially optimal) for any A > Ao(X) and some m = mo (A, X). Furthermore, 
by Theorem 3 below, among the set of training sequences for which the proposed classifier 
is essentially optimal, there are some for which no e-efficient classifier for A'-sequences 
exists, if log A' < log A for any e < log A — H m i n (N, X, e). Thus, the proposed classifier is 
essentially optimal. 

Finally, the following universal classification problem is considered: Given two test-sequences 
Y-^ and Z^ and no training data, are these two test-sequences "similar" to each other? 
The case where both Y^ 1 and Z^ are realizations of some (unknown) finite-order Markov 
processes is discussed in [14] , where an asymptotically optimal empirical divergence measure 
is derived empirically from Y-^ and Z^ . 
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In the context of the individual-sequence approach that is adopted here, this amounts to the 
following problem: Given Y-^ and , is there a training-sequence X for which H e ^ m i n (K) > 
0, such that both and are accepted by some e-efficient universal classifier with linear 
space complexity? (this problem is a reminiscence of the "common ancestor" problem in 
computational biology [15], where one may think of X as a training sequence that captures 
the properties of a possible "common ancestor" of two DNA sequences Yf* and Z^). 

This is the topic of the Corollary following Theorem 3 below. 



B. Definitions, Theorems and Algorithms 



Given E A^, let c{X^);X <G A N be a one-to-one mapping of X± into a binary 
sequence of length L(X^), which is called the length function of c(X^). It is assumed that 
L(X^) satisfies the Kraft inequality. 

For every X and any positive integers M, N, define the compression of the prefix X^ M to 
be: 



p L (X,N,M) = max 



i;i<i<N-i NM 



M-2 



3=0 



+ L(X\) + £(^t^i + (Af-l)AT)) 



Thus, one looks for the compression of the sequence X^ N that is achieved by successively 
applying a given length-function L(X^) and with the worst starting 

phase i;i = 1,2, N — 1. It should be noted that for any given length- function L(X^), 
one can construct another length- function for iV 2 -vectors L(X^ 2 ) such that, when applied 

successively to vectors X^^ 1 ^ ;j = 0, 1,2, ...M — 1, the resulting compression will be 

no larger than the compression that is achieved by applying L(X^^^ N ^) to successive 

vectors X^ ( ^ N) , up to a factor O(jj) where j = 1,2, ...MN 2 and any 1 < i < N - 1. 
Thus, asymptotically as M and N tend to infinity, the starting phase i has a diminishing 
effect on the best achievable compression. Observe that by ignoring the terms j^jL(X\) 

and jv^^(^(T-fi+(JU"-i)AT)) (that vanish for large values of M) in the expression above, one 
gets a lower bound on the actual compression. 

In the following, a lower-bound Hmn{N) on pl(X.,N,M) is derived, that applies to any 
length- function L(X^). First, the notion of empirical probability of an N- vector in a finite 
MTV-vector is derived, for any two positive integer N and M. 

Given a positive integer t; 1 < t < N and a sequence X^ N , define for a vector Z[ G A e , 

M-2 

pmnazi n) = 1 ±- l y: v ; i < < < ^ - 1 (i) 

3=0 
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and 

N 



Pmn(Z(, N) = ^fl p MN,i(Zl N) (2) 
i=i 

where, 

V(4i lK+l_1 ) = 1 iff xfiS*"- 1 = Z 'i> ^.(Xj^r^- 1 ) = 0. 
Thus, 

(M-1)JV + . ; 



(iW-l)jV+l 



is the standard definition of the empirical probability of Z[. 

(As noted in the Introduction, Pmn(Z±, N) converges to the empirical probability Pmn{Z[, £) 
as M tends to infinity. However, for finite values of M, these two quantities are not identical 
due to end-effects). 

Let, 

Hmn(£,N) = -j Pmn{Z(,N) log P M n{Z(,N) 

and 

H MN {N) = H MN (N,N) = ~ Pmn(Z^N) log P M n(Z?,N) (3) 



N , 
zf eA N 



then, 

Proposition 1 

and 



p L (X,N,M)>H MN (N) 



Urn sup p L (X,N,M) > H(X,N) 

M^oo 



where H(K, N) = limsup^^^ Hmn(N). The proof appears in the Appendix. 

Thus, the best possible compression of X™ N that may be achieved by any one-to-one 
encoder for iV-blocks, is bounded from below by Hmn{N). Furthermore, Hmn(N) is 
achievable for N that is much smaller than logM, if c(-Zf) (and its corresponding length- 
function L{Zi) is tailored to the individual sequence X^ N , by first scanning X^ N , 
evaluating the empirical distribution Pmn(Z^ , N) of iV-vectors and then applying the 
corresponding Huffman data compression algorithm. However, in practice, the data-compression 
has to be executed on-line and the only available information on X^ IN is the one that is 
contained the currently processed N-block. 
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The main topic of this paper is to find out how well one can do in such a case where the 
same mapping c(X^) is being applied to successive N- vectors of X^ IN . 

Next, given N and M a particular context-tree is generated from X^ N for each letter 
Xi\ 1 < i < MN, and a related conditional empirical entropy H U (N, M) is defined 
. It is then demonstrated that for large enough N and M, H U (N,M) may also serve as a 
lower bound on pl(X, N, M). 

Construction of the Context-tree for the letter Zj 

1) Consider contexts which are not longer than t = |~(log./V) 2 ] and let K be a positive 
number. 

2) Let Ki(Z^,K) = mm\j - l,t] where j is the smallest positive integer such that 
Pmn(Z{, N) < j^, where the probability measure Pmn(Z{, N) for vectors Z\ £ h? is 
derived from X^ N . If such j does not exist, set K\(Z^ ,K) = —1, where Z\ is the null 
vector. 

3) Given Xf N evaluate P MN {Z\,N). For the i-th instance in Zf, let Z\~ l be the 
corresponding suffix. For each i particular z 1 ^ 1 € A* -1 define 

4) Let jo = 3o(Z{ _1 ) be the integer for which, 



Hmn (i | Cin ' N ) = min Hmn (* I C" > iV ) 



Each such jo is a node in a tree with about K leaves. The set of all such nodes represents 
the particular context tree for the i-th instance. 

5) Finally, 

H U (N,K,M)= PMN{z[-\N)H MN (i\z\l) o ,N) (5) 



Observe that H U (N, K, M) is an entropy-like quantity defined by an optimal data-driven 
tree of variable depth K\, where each leaf that is shorter than t has roughly an empirical 
probability 

Set K = N and let H U (N, M) = H U (N, N, M). Also, let 

H U (X,N) = lim sup H U (N,M) (6) 

M^oo 
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and 



H U {X) = limsup# u (X,iV) (7) 



The different empirical quantities that are defined above reappear throughout the paper 
and are therefore summarised in the short Glossary below. 

Glossary: 

1) Pmn(z{,N): An empirical probability measure on ^-vectors z\ in the vector Z^ N . 

2) PMN(zi\z°_ i+1 , N) = p^^o W 'tv) ' A conditional empirical probability measure. 

3) Hmn{^,N) : An empirical entropy that is derived from Pmn{z{,N). 

4) Hmn(N)=H M n(N, AO; H(X, N) = limsup^^ H M (N); H(X) = limsup^^ H(X, N) 

5) H (i\z 1 -^ 1 , N): A conditional empirical entropy, conditioned on the particular suffix z 1 ^ 1 . 

6) Ki(z°_ N+1 , K) is the smallest positive integer j such that Pmn{z-j+i) < ^- 

7) H U (N, K, M): The average of the minimal value of H(l\z°^ j+1 , N); 1 < j < l+Ki(z°_ N+1 , K) 
over z°_ j+1 ; H U (N,N,M) is denoted by H U (N,M). 

8) H U (X,N) = limsupj^ H U (N,M) ; H U (X) = limsup^^ H U (X, N) 
Let 

H(X) = limsup H(X,N) 

AT— >00 

where, as defined above, H(X,N) = limsup M ^oo Hmn(N). Then, 
Lemma 1 For every individual sequence X, 



lim inf 



H(X,N)-H U (X,N)\ >0 (8) 
Hence, H U (X) < H(X) . The proof of Lemma 1 appears in the Appendix. 



Any compression algorithm that achieves a compression H U (X, N) is therefore asymptoti- 
cally optimal as N tends to infinity. 

Note that the conditional entropy for a data-driven Markov tree of a uniform depth of, say, 
0(loglog N) may still satisfy Lemma 1 (by the proof of Lemma 1), but the corresponding 
conditional empirical entropy is lower-bounded by H U (X, N) for finite values of N. 
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A context-tree data-compression algorithm for A-blocks that essentially achieves a 
compression H U (X.,N) is introduced below, and is therefore asymptotically optimal, but 
so are other universal data-compression algorithms (e.g., as mentioned above, a simpler 
context-tree data compression algorithm with a uniform context depth or the 
LZ algorithm [1]). However, the particular context-tree algorithm that is proposed below 
is shown to be essentially optimal for non- asymptotic values of iV as well. 

Let 5 be an arbitrarily small positive number and let us consider the class Cn 0> m ,s of all 
X-sequences for which, for some H such that 5 < H < (1 — 25) log A, 

1) H MoNo {N ,N ) = H. 

2) H U (N , K , M )-H <5 where K = A . 

Theorem 1 A)The class Cn ,m ,s is not empty. Every sequence X is in the set Cn ,m ,5 
for large enough Nq and Mq = Mq(Nq). 

B)Let N' = N l - S . For any universal data- compression algorithm for N' -vectors that utilises 
some length-function L(Z^ ), there exist some sequences X G CjVo,Mo,<5 such that for any 
M >M and any N' > N ^ : 

p L (X,N',M) > {l-5)[logA-5}> H 

for large enough Nq. 

The proof of Theorem 1 appears in the Appendix. 

The next step demonstrates that there exists a universal data-compression algorithm, which 
is optimal in the sense that when it is applied to consecutive iV-blocks, its associated 
compression is about H U (X.,N') for every individual sequence X where logiV' is slightly 
smaller than log A. 

Theorem 2 Let 5 be an arbitrarily small positive number and let N' = [N 1 ^ 6 ] . There 
exists a context-tree universal coding algorithm for N -blocks, with a length-function L(Z^) 
for which, for every individual X^ N G A MN , 

p t (X, A, M) < H U (N, A', M)+0 

It should be noted here that it is not claimed that the particular algorithm that is described 
below yields the smallest possible computational complexity. It is sufficed to establish the 
fact that this particular essentialy optimal universal algorithm indeed belongs to the class 
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of context-tree algorithms. The reader is referred to [2] and [4] for an exhaustive discussion 
of optimal universal context-tree algorithms where the issues of minimal redundancy and 
computational complexity are discussed in length. Therefore, no attempt was made to min- 
imise or evaluate the computational complexity of this particular "test-bench" algorithm. 
The practitioner reader is referred to [2] where it was established that an optimal universal 
context coding may be realized by an algorithm with a computational complexity that grows 
only linearly with the block-length N. For an enlightening perspective on computationally 
bounded data-compression algorithms (which are not necessarily "essentially optimal" in 
the sense of Theorem 2) see [21]. 

Description of the universal compression algorithm: Consider first the encoding 
of the first iV-vector (to be repeated for every X^_^ N+1 ; i = 2,3, ...,M). Let t = 

[(logiV) 2 ] and M' = f (assuming that t divides N). 

A) Generate the set T'{X^) that consists of all contexts x 1 ^ 1 that appear in X^ , satis- 
fying P N (x[- \t) >-fr;i<t, where N' = N 1 ' 5 ([16], [17]. 

Clearly, T'(X^) is a context tree with no more than iV 1-5 leaves with a maximum 
depth of t = [(log iV) 2 ] . The depth t is chosen to be just small enough so as to yield 
an implementable compression scheme and at the same time, still be big enough so as 
to yield an efficient enough compression. 

B) Evaluate, 

H u {t,N',M')= V P N {x\_ t+1 ,t) min t) 

x *-^t-i o< 3 <t-i;^T>(xn 

Note that H u (t, K, M'); K = N' is a conditional empirical entropy that is derived 
from an empirical probability measure of t-vectors in the particular iV-vector X± 
while previously H U (N, K, M) has been derived from an empirical probability measure 
of iV-vectors in the whole individual sequence X^ N (See Glossary at the end of 
Section B). Also note that the number of computational steps that are involved in the 
minimisation is t\T'(X^)\ < O(N). 

Let T^Xi) be a sub-tree of T'(X^), such that its leaves are the contexts that achieve 
H u (t,N',M'). 

C) A length function L{X?) = Li(X^) + L 2 (Xf ) + L 3 (X?) is constructed as follows: 

1) Li(X^) is the length of an uncompressed binary length-function, rh\{X^) that 
enables the decoder to reconstruct the context tree T^(X^), that consists of the 
set of contexts that achieves H u (t, N' , M'). This tree has, by construction, at 
most N 1 ^ 6 leaves and at most t letters per leaf. It takes at most 1 + tlog^4 bits 
to encode a vector of length t over an alphabet of A letters. It also takes at most 
1 + logt bits to encode the length of a particular context. Therefore, 
Li(Xf) < N l - & (t\ogA + \ogt + 2) < [logiVlog^ + log(logiV 2 ) + 2]N 1 ~ & bits. 
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2) L2{Xi) is the length of a binary word m,2(X^), (tlogA bits long), which is an 
uncompressed binary mapping of X\ , the first t letters of X± . 

3) Observe that given rhi(Xi), and 7712 (-X - ^), the decoder can re-generate X\ and 
the sub-tree T^X?) that achieves H u (t, N', M'). Given T^(X^) and a prefix X{ 
of Xi , XK. 1 is compressed by a context-tree algorithm for FSMX sources [3], [4], 
which is tailored to the contexts that are the leaves of T^(X^), yielding a length 
function L 3 (Xf ) < NH u (t,N',M') + 0(1) where the small 0(1) redundancy 
is achieved by combining arithmetic coding with Krichevsky-Trofimov mixtures 
[4]- 

D) Thus, 

L{X?) < N[H u (t,N',M') + 0((logN) 2 N' 5 )] 

Repeat the steps 1), 2) and 3) above for the N-vectors X l ^_^ N+1 ;i = 2,3, ...,M and 
denote by H Uji (t, N' ', M') the quantity H u (t,N',M') that is derived for the iV-vector 

V i+N 
A i+1 • 

Proof of Theorem 2: Let iV" = N 1 ' 25 and let T u " (X^ N ) be the subset of contexts for 
which the minimisation that yields H U (N, N" , M) is achieved. 

The proof of Theorem 2 follows from the construction and by the convexity of the entropy 
function since 

M-l 

— J2 H ^,i (t, N', M') < H U (N, N" , M) + 0(N~ 5 ) + O 

i=0 

and where the term 0(N~ S ) is an upper-bound on the relative frequency of instances in X^ 
that have as a context a leaf of T'(X^) that is a suffix of a leaf of T M " (X^ N ) and is therefore 
not included in the set of contexts that achieve H U (N, AT", M). The term 0( ^ og ^ ) is due 
to end-effects and follows from Lemma 2.7 in [5, page 33] (see proof of Lemma 1 in the 
Appendix). 

In conclusion, the proposed algorithm is based on an empirical VLMC and is a universal 
context-tree algorithm, but, as mentioned above, not necessarily the best one in terms of 
computational complexity. This issue and others are thoroughly discussed in [4]. 



C. Application to Universal Classification 

A device called classifier (or discriminator) observes an individual training sequence of 
length of m letters, Xf 1 . 
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The classifier's task is to consider individual test sequences of length N and decide whether 
the test N-sequence has the same features as those that are captured by the training se- 
quence, or is sufficiently different, according to some appropriate criterion. No a-priori 
information about the test sequences is available to the classifier aside from the train- 
ing sequence. Following the discussion in the Introduction section, a universal classifier 
d(X™, A^) for N-vectors is defined to be a mapping from onto {0, 1}. Upon observing 
Z± , the classifier declares Z^ to be similar to one of the iV-vectors Xj^;j = 0, 1, ...,m—N 
iff c^Af 1 , G A N ) = 1 (or, in some applications, if a slightly distorted version Z^ of Z^ 
satisfies d{X?,Z?) = 1). Denote by S d {N,e,X?) a set of N -sequences Z^ which are de- 
clared to be similar to A™, i.e. d{X™, Zf) = 1), where d(A™, Xj+f ) = should be satisfied 
by no more than e{m — N + 1) instances j = 0, 1, 2, m — N, and where e is an arbitrarily 
small positive number. Also, given a particular classifier, let D d (N, X™, e) = \S d (N, e, Xj 71 ) | , 
and let 

H d (N, Xf, e) = 1 log D d (N, X™, e) . 

Thus, any classifier is characterised by a certain H d (N, X™, e). Given X™, let D m i n (N, X™,e) 
be the smallest achievable D d (N, X™, e). Denote by d* the particular classifier that achieve 
D mm (N, Xf, e) = D d , (N, Xf\ e) and let 

H min (N, X™, e) = 1 log (N, X?, e) . 



Naturally, if the classifier has the complete list of N- vectors that achieve D m i n (N, X™, e), 
it can perform a perfect classification by making d(X™, Z±) = d*(X™, Z^) = 1 iff Z^ = 
Xj+f for every instance j = 0,l,2,...,m - N for which Xj+f e S d * (A, e, X?). The 
discussion is constrained to cases where H m i n (N ', X™ , e) > 0. Therefore, when m is large, 
D m i n (N,X™, e) grows exponentially with N. 

The attention is limited to classifiers that have a storage-space complexity that grows only 
linearly with N. Thus the training sequence cannot be stored within this limited memory. 
Rather, the classifier should represent the long training sequence with a short "signature" 
and use it to classify incoming test sequences of length N. It is shown that it is possible to 
find such a classifier, denoted by d(X™ , e, A N ), that is essentially optimal in the following 
sense(as discussed and motivated in the Introduction section): d(X™, e, Z±) G A N is de- 
fined to be one that satisfies the condition that d(X™, Aj^) = 1 for (1 — e)(m — N + 1) 
instances j = 0, 1, ...m — N, where e < e. This corresponds to a rejection of at most eN 
vectors among N- vectors in XJ™. Also, an optimal "e-efficient" universal classifier should 
satisfy the condition that d(X™,Z^) = 1 is satisfied by no more than 2 NHmm ( N ' X ™^ +e 
N-vectors Z± . 

Observe that in the case where X is a realization of a finite-alphabet stationary ergodic 
process, lim e ^o lini7V->oo nm su Pm^oo H m i n (N, XJ™, 0) is equal almost surely to the entropy- 
rate of the source and, for large enough m and X, the classifier efficiently identifies typical 
X-vectors without searching the exponentially large list of typical N-vectors, by replacing 
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the long training sequence with an "optimal sufficient statistics" that occupies a memory 
of O(A) only. 

In the following, a universal context classifier for N- vectors with a storage-space complexity 
that is linear in A, is shown to be essentially optimal for large enough A and m. 



Description of the universal classification algorithm: Assuming that A divides m, 
let M = f and let A" = [N 1 ^]. 

A) Evaluate H min (N, X^ 1 , \e). 

This step is carried out by generating an ordered list of the different A-vectors that 
appear in X™ according to their decreasing empirical probability 
Pmn(Z?, N); Zf e A N (see Glossary in Section B). Let S min (N, X? , e) be the small- 
est set of the most probable A-vectors such that PMN[Smin(N,X™,^e)\ > 1 — \e. 
Then H min (N,X?,±e) = ^ log |S min (A, Af, ±e)|. 

B) First pass: gather all the contexts that are no longer than t = [(log A) 2 ] and that 
each appears at least MA" times in the training sequence A™, and generate a context 
tree. 

Second pass: Compute H U (N,N",M) where H U (N,N",M) is given by Eq (5), with 
A" replacing K. Let T u " {X™) be the subset of contexts for which the minimisation 
that yields H U (N, A", M) is achieved. Clearly, |T u "(Af)| < A". 

The computational complexity of steps A) and B) is 0(m) (using suffix tree methods 
for the first pass and dynamic programming for the second pass [16], [17], [4]). Note 
however that steps A), and B) above are preliminary pre-processing steps that are 
carried out once, prior to the construction of the classifier that is tailored to the train- 
ing data A™, and is not repeated for each test-sequence. The subset T u " (X™) is the 
"signature" of A™ which, together with the quantity H m i n (N, A{™, |e) and the corre- 
sponding set of empirical probabilities PMN(xi\x°_ i+l , N) for every x°_ i+l € T u " (X™) 
are stored in the memory of the classifier (see Glossary in Section B). The storage 
complexity is at most O(N). 

C) Let x°_ i+l denote a context in the set T u "(Af). Compute h u (Z^ , Af ,T tt ",t) 

P N (x°_ i+1 ,t) Piv(zi|x m ,01ogPMiv(zi|x m ,A) 
x _ i+1 eT u "(x?) zieA 

, where Pjv(xi|x° i+1 ,i) and Pjv(x° t) are derived from the test sequence Z± . 

D) Let S{Z?,n) be the set of all Zf € A N such that g(Z^,Z*) < /x, where #(*,*) is 
some non-negative distortion function satisfying: g(Z± , Z±) = iff Z± = Z±. Given 
a test sequence Z± , let 
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A(zF,xr,fi) = 

i 



mm 



h u ,^,X^,T u ",t) - mm[H u (t, N', M'), H min (N, AT, ±e) 



where here [H u (t, N' , M') is evaluated for Z?,N' = N 1 ' 6 and H u (t, N', M') = H u (t, K, M'\ 
N' (see Glossary in section B). 

Note that if \i > 0, the number of computational steps that are involved in the 
minimisation may grow exponentially with N. 

Now set the particular classifier d(Z± , e, X™) to satisfy: d(Z^, e, X™)=\ iff 
A(Z^ , X™,n) < e', where e' is set so as to guarantee that d(X™, e, Xj^) = 1 for for 
at least (1 — e)(m — JV + 1) instances j = 0, 1...., m - N of X[ n . If H u (t, N' , M') + e' > 
log A, set d(Z?, e, X[ n )=l for every Zf € A N . 



Refer to a test sequence Z± as being e' acceptable (relative to X?) iff A(Zf , Xf , /i) < e'. 
It should be noted that for some small values of N, one may find values of m for which 
H s(N, X™, e) is much larger than H m i n (X.,e). It should also be noted that the space 
complexity of the proposed classifier is 0(N) and that if no distortion is allowed (i.e. fi = 0), 
the time complexity of the proposed algorithm is linear in ./V as well. 



Now set e' = \e 2 + 0{N~ e ). Then 



Theorem 3 1 )For any arbitrarily small positive e, the classifier that is described above 
accepts no more than 2 H ^ N ' X T^ e ) N-vectors where, if '\S{Z± , /_t)| < 2 Ne " 

limsupliminf H .-(TV, X?, e) < H mm (X, \e) + \e 2 + e" 

N^oo ™-*oo a 2 2 

Observe that H min (X., ^e) < H m i n (X, e) + 5(e) where lim e ^o5(e) = 0. 

2) There exist m-sequences X™ such that H^(N,X™,e) is much smaller than log A and 
for which no classifier can achieve H m i n (N' , X™ , e) < log A — e if log N' < log A, where 5 
is an arbitrarily small positive number. 



Thus, for every N > 7V (X), the proposed algorithm is essentially optimal for some tuq = 
mo(A, X) and is characterised by a storage-space complexity that is linear in N. Further- 
more, if one sets [i = (i.e. no distortion), the proposed algorithm is also characterised 
by a linear time-complexity. The proof of Theorem 3 appears in the Appendix. Also, it 
follows from the proof of Theorem 3 that if one generates a training sequence X such that 
lim inf m^oo H U (N, N, M) = lim M ->oo H U {N, N, M) 

(i.e. a "stationary" training sequence), then there always exist positive integers Nq and 
mo = m o(No) such that the proposed classifier is essentially optimal for any N > iVo(X) 
and for any m > mo(No), rather than for only some specific values of m that depend on N. 
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Now, let Y± and Z± be two N-sequences and assume that no training sequence is available. 
However, one still would like to test the hypothesis that there exists some test sequence X 
such that both iV-sequences are acceptable by this "essentially" e-emcient algorithm with 
respect to X. (This is a reminiscence of the "common ancestor" problem in computational 
biology where one may think of X as a training sequence that captures the properties of a 
possible "common ancestor" [15] of two DNA sequences Y^ and Z±). 



Corollary 1 Let Y^ and Z± be two N-sequences and let SiY^ , Z±) be the union of all 
their corresponding contexts that are no longer than t = [(log iV) 2 ] with an empirical prob- 
ability of at least N l_ € . 



If there does not exist a conditional probability distribution 

Pix^xl^x^eSiY"^?) 

such that, 

P N,Yf( X l\ X -i+l 



and at the same time, 



E P A^(*ll*-i+l>*) lo g 



P N,Z^{ X A X -i+l^) 



< e 



— , see 



(where P N y n (Yi\Y® i+l ,t) = P]\f(Yi\Y°_ i+1 , M') is empirically derived from Y^ and 
P N Z N(X\\X°_ i+1 ,t) = P]\f(Zi\Z^_ i+1 , M') is empirically derived from Z^, M' 
Glossary in section B), then there does not exist a training sequence X such that for some 
positive integer m for which H min (N, X™, \e) > e and H^(N, X™, e) < H min (N, X™, \e) + 
^e 2 and at the same time both Y^ and Z^ are e' '-acceptable relative to X™. Here, the 
condition H m i n (N,X™, \e) > e guarantees that X™ is not a degenerated training sequence 
and is exponentially "rich" in distinct N-vectors. 



In conclusion, it should be noted that it has not been claimed that the particular "test- 
bench" algorithm that was introduced here as a theoretical tool to establish Theorem 3, 
yields the smallest possible computational time complexity. In unison with the "individual 
sequence" justification for the essential optimality of Context-tree universal data compres- 
sion algorithm that was established above, these results may contribute a theoretical "indi- 
vidual sequence" justification for the Probabilistic Suffix Tree approach in learning and in 
computational biology [7], [8]. 
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Appendix 



Proof of Proposition 1: By definition, 

~M-2 



AT— 1 I 

PL (X,N,M) >nuK — 



E 

j=0 



i+(j+l)JV-l 



AT 



= Tj E p mn(Z?,N)L{Z?)>H mn {N) (9) 



which leads to Proposition 1 by the Kraft inequality. 



Proof of Lemma 1: Let iVo, Mq and M be positive numbers and let e = e(M) be an 
arbitrarily small positive number, satisfying log Mo > iVolog^, H U (X.) > H U (X, No) — e, 
and M > N 2 such that H U (M N , M N , M) > H U (X) - e, where H U (M N , M N , M) = 
H U (M N , K, M) with K = M N (See Glossary in section B). 



Note that for any vector z^° N ° the block-length is MqNq. Thus, the parameter t that 



determines K^Z^^K) here is t = [(log M N )' 2 ] (see section B). Thus, t > JV£ > JV - 



Therefore, by the properties of the entropy function, by applying the chain-rule to Hmn (Nq, MqNq) 
and by Eq (5), 

H MMoNo (N ,M N ) > H MMoNo (Z N \Z^-\M N ) > H U (M N ,M N ,M) - ^ 



>H u (X,N )-2e- 



logA 



where by Eq (4), the term is an upper-bound on the total contribution to Hmm n (Zn\Zi° l , MoNq) 
by vectors Zf " 1 for which PmMoN q (Z^-\ M q N ) < ^ 



rN -l 



M N < 1 



Now, \Pmm n Pmm q n (Z 1 °,Nq)\ < MMqNo — 

in [5, page 33], for any two probability distributions P(Z^°) and Q(Z^°), 



d]y • By Lemma 2.7 



E ^>^3|* 



< d\og 



A N 



where d = max^iv N 



.. 4 „ o \P(Z^°) - Q(Z?°)\. Hence, by letting d No play the role of d in the 
following expression, 



Hmm n (Nq, M N ) — HmmqNq (No, Nq) 



< — s 
" N 2 l 



Af logA + 21ogiVo 
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and therefore, 

HmMoNo (N ,N )>H u (X,N )-2e--^ \n log A + 2 log 7V 1 - ^ 
which, by Eqs (6) and (7) and by setting e = proves Lemma 1 (Eq (8)). 

Proof of Theorem 1: Consider the following construction of X^ M : Let h be an arbitrary 
small positive number and £ be a positive integer, where h and £ satisfy N = £2 he , and 
assume that £ divides N. 

1) Let Se^h be a set of some T' = y = 2 M distinct £- vectors from A^. 

2) Generate a concatenation of the T' distinct ^-vectors in S 1 ^. 

3) Return to step 2 for the generation of the next iV-block. 

Now, by construction, the M consecutive iV-blocks are identical to each other. Hence, 
p L {X,N,M) > j^j and by Eq (1), 

y > PmnAZLN) > 1, 2, ...N. 

Thus, by construction, Pmn{Z[,N) > ^fj^- ■ Furthermore, there exists a positive integer 
N = N (h) such that for any N > N , 

log N 

Hmn(£,N) < ~^<^h 

where 

H MN {l,N) = -- PMN(Z(,N)\ogP M N(Zi,N). 
z»eA N 

Observe that any vector Zj ^i + 1 < j < MN;1 < i < £ — 1, except for a subset of 
instances j with a total empirical probability measure of at most ^ef, is therefore a suffix 
of Z{ . where K = Nj^ an d that K^X? ,K) < t for any N > N (h), where 

t = [(logiV) 2 ]. Thus, by applying the chain-rule to Hmn(£,N), by the convexity of the 
entropy function and by Eq (5), 

H U (N,K,M) < H M n(Zi\Z q _, +1 ,N)<H mn (£,N) < 2h (10) 

Also, 

lim sup H U (N,K,M) = limsup H U {K, K, M - 1) =H U (X,N) < 2h 
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(see Glossary in section B). 

Consider now the class og^ of all sets like Sg^ that consists of 2 M distinct ^-vectors. The 
next step is to establish that no compression for iV-sequences which consist of the 2 he 
distinct £ vectors that are selected from some member in the class ag : h is possible, at least 
for some such iV-sequences. 

Let the normalised length-function L(Z^) be defined by: 



L(Zf) = -log 



Clearly, L(Z^) < L(Z^) since L(Zf) satisfies the Kraft inequality while L{Zf satisfies it 
with equality, since 2~ L ( z i ) is a probability measure. Then, 



— — 1 



L{Z?) > L(Z?) = L(Z^\Zf) 



i=0 

where 



L{Z$$ l \Z?) = L(z[ i+1)i ) - L(Z{ e ) 
is a (normalised) conditional length- function that, given X[ e , satisfies the Kraft inequality 
with equality, since 2 v u + 1 1 1 ; is a conditional probability measure. 

Lemma 2 For any h > 0, any arbitrarily small positive number 8, any N > Nq = Nq(£, h) 
and any L(x\\x°_ N+1 ) there exists a set of 2 M l-vectors such that 

Pn(4,£)L(x{\x°_ n+1 ) > £(l - 6)(logA- 8) 

x{eAt 

for all x°_ N+1 which are concatenations of £-vectors from Sg t h o,s described above. 



Proof of Lemma 2: The number of possible sets Sg t h that may be selected from the A e 
£ vectors over A is: 

' 2 (lo gJ 4)A 



Mg, h = 



2 hi 



Given a particular L(x\ \x°_ N+1 ), consider the collection M^^o of all sets Sg^s that 
consist of at most (1— 5)2 he vectors selected from the set of vectors x[ for which L(x\\x°_ N+1 ) > 
(log A - 8)1 (observe that there are at least 2 logM - 2 ( - lo ^ A ^ 5 ^ e such vectors). 



The collection M.g^ h ^\ x _ N+1 is referred to as the collection of "good" sets Sg^,s (i-e. sets 
yielding L(x{\x°_ N+1 ) < (log A — 8)£). 
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It will now be demonstrated that 



^x _ N+1 eA* M e,h,S\x°_ 



is exponentially smaller than 



M£ t h if N < S2 M (1 — h)t. Hence, for any conditional length- function L(x[ \x°_ N+1 ) and any 
x°_ jv+i) ^ A^, most of the sets Sg^ €E will not contain a "good" Sg^s G M^fc,5|x 
and therefore less than <52 w ^-vectors out of the 2 W ^-vectors in Sg^ will De associated with 
an L(x{\x°_ N+1 ) < {log A- 5)1. 

The cardinality of M^^o^ is upper bounded by: 2^ j=S2 u { )\ j )• 

Now, by [3], one has for a large enough positive integer n, 

where h(p) = —plog 2 p — (1 — p) log 2 (l — p) and where lim n ^oo e(n) = 0. 
Thus, 

where limjv^oo e'(iV) = 0. Therefore, if N < 52 M (l — h)£, there exists some Sg^ for which, 

£ 2- w L(xi|x°_^ +1 ) > £(1 - <f)(log A - <5) 

for all N- vectors G A N . 

Hence by construction, there exists some Sg^ for which, 



L(x?)>L(x?)= Y,L(x^ e \x\ e )>N[(l-S)logA-5) + e'(N) 



i=0 



Therefore, it follows that the class Cjv ,m ,<5 is not empty since, by construction, the class 

Se^h of cardinality Mg^ = ( 2( 2 w ' ) of sequences is is included in Cn ,m ,8 for /i = | and for 
I that satisfy Nq = 12 M . Moreover, by Lemma 1, it follows that every sequence X is in the 
set C Noi m ,8, for large enough N and M = M (N ). 

This completes the proof of Lemma 2 and setting h = 5, the proof of Theorem 1. 



Proof of Theorem 3: Consider the one-to-one mapping of with the following 
length- function (following the definition of L(X^) in the proof of Theorem 2 in section B 
above): 



1) L{Zp) = 2 + N[H u (t,N',M') + 0((\ogN) 2 N-^} if H u (t, N', M') < H min (N, X™, \e) 
and if G S d *(N, X{ n , \e), where H u (t, N', M') is evaluated for . 
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2) L{Z") = 2 + N[H min (N, Xf, ±e)] if H U (X, M', N>) > H mm (N, Xf, ±e) and if € 
S d *(N,X™,e) 

3) L(Z^) < 2 + X[P" u (t, TV', M') + 0((log) 2 iV- e )]] otherwise. 
Now 



m-N 

m _ ]vTT L ( X j+l) - h mn(n) 



3=0 

For any N > Nq and some M > Mo = Mq(Nq) (see Eq (6) and Glossary in section B), 

H U (N,N",M) <H MN (N) + \e 2 

Thus, 

ro-AT 



Also, by construction (see section C), 
h u (Zi ' , X™, T u " ,t) 

PN{z°_ i+1 ,t) Ar(^i|£ m ,01ogPMiV^ik- i+ i,^) 
^° i+ ieT u "(Xf ) ^ieA 

where Pjv(zi|2° i+1 , t) and P/v(^°j +1 , t) are derived from the test sequence Z± . 

Let Patj (xl i+1 , i) denote Pjv(xLj+i5*) where for < j < m — i, Pjv(xi i+1 ,t) is derived 
from the substring Xj^ of Xf. 

Thus, < |^^+t T,T=o N PN,j{xl i+l ,t) - P M N{xl i+1 ,t)\ < due to end-effects (see 

introduction). Also note that log Pm n (xi \x^_ i+1 , N) < logX for every x*_ i+1 € T'^XJ 71 ). 
Hence, it follows that 

ZT=o N hup*?, Xr, T tt " , t) < H U (N, X" , M) + 0(«) 



Thus, 



^K(xji jv ,xr,r u ",t)-L((x|+f)] <^ 2 +o( 



[logX] £ 



m - X + 1 ^ L ^ ^ ' 1 ' " ' ' VV J+l /J - 4 " ■ - V / 

j=0 

Let, 

A(Z 1 iV ,Xf) = ^(Zf,X-,T u ",t) - H u (t,N',M') 
where H u (t,N',M') is evaluated for . 

Let T'(X^) be a set that consists of all contexts x 1 ^ 1 that appear in , satisfying 
Pjv(xi _1 ,t) > < t, where X' = X 1 " 6 . 
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Note that A(Z^ , X™) is similar to a divergence measure. A(Z^ , X™) + 0(N~ e ) > where 
the term 0(N~ S ) is an upper-bound on the relative frequency of instances in that have 
as a context a leaf of T' U {Z^) that is a suffix of a leaf of T n "(X™) and is therefore not 
included in the set of contexts that achieve H U (N, N" , M), where H U (N, N" , M) is derived 
from . 

Thus, since L(X?) < H u (t,N',M')) + % + 0((log N) 2 N- e ), also 

h u (X?,XF,T u »,t) - L((X?) + 0((log) 2 iV-) + A > 

Therefore, 

m—N 

m _ N + l E [^(^f ,^r,T u ",t) - L((Xj+f )] + 0((logiV) 2 iV-) + - < -(e) 2 

3=0 



Note that L(Xj^) = mm[H min (N, X? , ±e), fl^t, TV', M')]+0((log A0 2 A^ e )++f if € 
S*d* (AT, Xf 1 , ^e), which is valid for m(l — ^e) instances j in X™, where H u (t,N',M') is 
evaluated for . 

Assume that all the other instances j are rejected by the classifier 
(i.e. d(X?,e, ' 

Statement 1) in Theorem 3 then follows by the Markov inequality for non- negative random 
variables and by setting e' = \e 2 + 0((log N) 2 N~ e ), where (ignoring the effect of the term 
0(N^ t ) for large N) there are at most \em instances j with Z^j G S^* (N, X™, \e) are 
rejected, on top of \em instances where Z^^ is not in instances era as required from 
an e-efncient classifier. Also, since h U(l {Z^ , X™ ,T U " ,t) is a length function, the classifier 
accepts no more than 2 Hmin ( N ' X ™'^ + ¥ 2+ °( N ~ € ") +e " where the term e" is due to the fact that 
for ji > 0, the discussion is limited to cases where \S(Z^ , < 2 Ne " . 

Consider the class of sequences X that are generated in the proof of Theorem 1 above. It 
is Demonstrated that for every such individual sequence, H U (X.,N) < 2h where h is an 
arbitrarily small positive number. Thus, there exists an m such that 
H d (N, Xf, e) < 2h + \e 2 + e" . 

Now, assume that the classifier is successively fed with every Z^' € A^' and consider the 
list of all the N'- vectors that are accepted (i.e. Z± : d(Z^ , Xf 1 , e) = 1) and a compression 
algorithm that assigns a length-function that is equal to NH^(Z^ , X™,e) + 1 to each of 
the accepted vectors and N\ogA + 1 to any of the 'rejected N'- vectors. This results in a 
compression of H^(N', X™, e) + e + jp. 

Following the proof of Lemma 2 that led to the proof of Theorem 1 above, and observing 
that if a " signature" of N' bits that is generated from X™ is made available to a universal 
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data-compression algorithm for iV'-vectors, Theorem 1 still holds for some of the sequences 
that are generated as described above, if log N' < log N. 

This leads to statement 2) of Theorem 3 and completes the proof of Theorem 3. 
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